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Abstract 

An attempt to use phylogenetic invariants for tree reconstruction was made at 
the end of the 80s and the beginning of the 90s by several authors (the initial idea 
due to Lake |Lak87j and Cavender and Felsenstein |CF87| ). However, the efficiency 
of methods based on invariants is still in doubt f |Hue9 5 . JN90 ), probably because 
these methods only used few generators of the set of phylogenetic invariants. The 
method studied in this paper was first introduced in |CGS05j and it is the first 
method based on invariants that uses the whole set of generators for DNA data. The 
simulation studies performed in this paper prove that it is a very competitive and 
highly efficient phylogenetic reconstruction method, especially for non-homogeneous 
models on phylogenetic trees. 



Introduction 

Since the introduction of phylogenetic invariants by Cavender and Felsenstein JpF87 , 
Lake |Lak87| and Evans and Speed |ES93j several attempts to give a generating set of 
polynomial phylogenetic invariants have been made (see for example J5SEW93J, FS95J) 



but it has not been until recently that algebraic geometers have managed to find them all 
[AR04aJ, [SS05J, CS05J. Methods based on invariants have already proved to be useful 
in comparative genomics |SB99| ). However, a perception seems to have developed that 
invariants are inefficient, in the technical sense of requiring long sequences to correctly 
infer a phylogenetic tree, cf. JN90J, [Hue95j. While [Hue95j showed the inefficiency for the 



use of Lake's invariants alone, no method using all invariants had been proposed at that 
point, and the question for invariants-based methods in general was never investigated. 
Note that Lake's method of invariants only used two phylogenetic invariants of degree 
one among the 795 generators of the set of polynomial invariants of a quartet tree for the 
Kimura 2-parameter model |GP05j . But as Felsenstein explained, invariants are worth 
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more attention for what they might lead to in the future |Fel03j . This future may be 
soon, since the studies of this paper show a method based on invariants which is indeed 
promising. Recently, other methods based on a large set of invariants have also been 
considered |Kri05| . |KKPP06| . 

Phylogenetic invariants are relationships satisfied by the expected pattern frequencies 
occurring in sequences evolving along a given tree topology T under an evolutionary 
model. More precisely, if t is the set of d model parameters on T and p a (t) is the 
probability of observing the pattern a at the leaves of T, by letting t vary on an open 
subset of R d , the probability vector pit) = (paa...a{^), ■ ■ ■ ,Ptt...t(^)) defines a subset St 
of dimension < d of 1R 4 ™. A phylogenetic invariant is a real- valued continuous function 
f(x) on IR 4 ™ such that f{p) = for any p £ St, but not for all the points on the subset St> 
determined by another tree topology T' . Essentially, the equations f(j>) = are satisfied 
for pattern frequencies arising from any model parameters on a fixed tree, so they might 
be used for recovering the tree topology. 

In practice, the vector of observed pattern frequencies p obtained from an alignment 
of sequences for n taxa with enough data, should approximate p(t) for some set of pa- 
rameters t on a tree topology T. In other words, p should be a point close to the subset 
St so, if / is an invariant for the topology T, one should have that f{p) is very close to 
0. As the tree topology is identifiable via invariant-based methods (AR06j . using phylo- 
genetic invariants for tree reconstruction is a consistent method (see |HL00j . |Fel03| ) . A 
practical introduction to the theory of invariants can be found in the book of J. Felsen- 
stein |Fel03| chapter 22], whereas the book |PS05j provides a beautiful insight into the 
applications of algebraic statistics (and in particular polynomial phylogenetic invariants) 
to computational biology. 

There are two major motivations for using phylogenetic invariants in tree reconstruc- 
tion. One of them is the prohibitive computational expense of a full maximum likeli- 
hood estimation of a tree, its edge lengths, a base distribution, and a rate matrix. The 
other is that the evolutionary models underlying the theory of invariants allow for non- 
homogeneous mutation. Indeed, it is known that, for some biological data sets, different 
rate matrices should be allowed in different lineages. Thus it is essential to have at our dis- 
posal phylogenetic methods for reconstructing trees admitting non-homogeneous models 
[YY99] . |OQ98| . 

In this paper, a phylogenetic reconstruction method that uses polynomial phylogenetic 
invariants (introduced in |CGS05| ) is studied and tested for quartet trees evolving under 
the Kimura 3-parameter model of nucleotide substitution |Kim81j . Actually, we consider 
an algebraic Kimura model: the parameters of the model are the entries of the substi- 
tution matrices on the edges (and not a single rate matrix together with edge lengths). 
Hence the model is non-homogenous — because it allows different rate matrices among the 
edges — but it is stationary (and the distribution of the bases is uniform), and we always 
assume that all sites are independent and identically distributed (i.i.d. hypotheses). We 
performed simulation studies to test its efficiency. One of our approaches to evaluate the 
performance and efficiency of the method is taken from Huelsenbeck jHue95j so that a 
large portion of the tree space is examined to get a general idea of how the algorithm 
performs. We present the results obtained for sequences of length 100 up to 10000. 
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We also checked the performance of the method on simulated data from non-homoge- 
neous models by carrying out a comparison to Neighbor- Joining algorithm J3N87J, max- 
imum likelihood algorithm |Fel81j . and an algorithm for a non-homogeneous model from 
PAML [Yan97] for sequences generated under a Kimura 2-parameter model [Kim80j and 
different rate matrices along different tree branches. 



Results 



Homogeneous data 

The performance of the invariants method studied here on homogeneous Kimura models 
can be seen in the figure [TJ Using the approach of J. P. Huelsenbeck in [Hue95j for quartet 
trees, we considered two branch-length parameters a, b and simulated data for each pair of 
lengths. Parameter a assigns the branch length to the internal branch and two opposite 
peripheral branches, and parameter b assigns the branch length to the two remaining 
branches. Parameters a and b were varied from 0.01 to 0.75 in increments of 0.02, and 
1000 alignments were simulated for each couple (a, b) (see figures 2 and 3 in }Hue95j . or 
figure I in the Supplementary Material for a clear picture of this space of parameters). 
The simulated trees evolve under the Kimura 3-parameter model |Kim81j with a fixed 
rate matrix of the form 
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along the tree (• = —a — (3 — 7). Figure ^ shows the efficiency of the method considered 
in this paper for a rate matrix with parameters 7 = 0.1, a = 3.0,/? = 0.5 (hence, a 5:1 
transitiomtransversion bias) and for nucleotide sequences of lengths 100, 500, 1000 and 
10000. See also figures III and IV of the Supplementary Material for studies performed 
with other rate matrices. 

This figure is to be compared with those shown in Figure A2 of |Hue95j (corresponding 
to phylogenetic inference for sequences generated under a Kimura 2-parameter model of 
substitution |Kim80j with 5:1 transitiomtransversion bias). Though this is of course a 
biased comparison because our method admits non-homogeneous data and Kimura 3- 
parameter model, it is worth noticing that even on data from homogeneous simulations 
our method outperforms many of the methods considered there. In particular, it is clearly 
better than Lake's invariant method (which is not surprising because Lake's method only 
used linear invariants). At least for sequences of length 500 or larger, our method performs 
better than neighbor-joining — referred as Minimum Evolution (Kimurai) in figure A2 of 
Huelsenbeck's paper. Notice also that the shape of the 95%-isocline for lengths around 
500 or larger is quite different from the corresponding shapes in the methods tested by 
Huelsenbeck (see his figure 7): for large values of a (near 0.7), the performance of our 
method of invariants does not drop drastically (as it does for most methods considered 
there). Therefore for these values our invariants method outperforms all the methods 
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Length=100 Length=500 




0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 

parameter a parameter a 



Figure 1: The graphics above represent the probability of reconstructing the correct tree in the 
parameter space (see Figure I in the Supplementary Material). Black areas correspond to couples 
(a, b) for which the tree was correctly estimated, while white regions correspond to couples (a, b) 
for which the tree is never estimated; grey tones indicate areas of intermediate probability. The 
95% isocline is drawn in white, while the 33% is drawn in black. The four graphics above 
show the results obtained under the homogeneous Kimura 3-parameter model when using a rate 
matrix with parameters 7 = 0.1, a = 3.0, (3 = 0.5 (hence a 5:1 transitiomtransversion bias). 
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studied in |Hue95j . As it can be seen in figures III and IV in the Supplementary Material, 
it seems that the method performs better for small transition: transversion ratios. 

For length 1000, the efficiency of the invariants method considered here is similar to 
that obtained for lengths > 10000 in many of the methods tested in |Hue95j . From this 
it can be inferred that, in order to reconstruct the correct tree, much less data is needed 
in the invariants method presented here than in many other methods (contrary to what 
was thought until now [HL00J). 

Non-homogeneous data 

We tested the invariants method studied in this paper on data simulated according to a 
non- homogeneous Kimura model by comparing it with other methods. 

1. Comparison with neighbor-joining 

First of all we compared the performance of the invariants method presented here with 
neighbor-j oining (the algorithm of [SN87 ) using Kimura 3-parameter distance |Kim81j 
. As it can be seen in figure El considering certain non-homogeneous sets of simulated 
data, we found that the invariants method is more efficient than neighbor-j oining. Indeed, 




Figure 2: Performance of neighbor-joining and the invariants method on non-homogeneous 
data for sequence length varying from 100 to 3000 in intervals of 100. For each length, 100 sets 
of data were generated. The non-homogeneous tree used for simulations is described in figureOl 

we simulated data on an unrooted quartet tree evolving under the Kimura 3-parameter 
model where different rate matrices at each edge were chosen (see figure Efl)) and we 
studied the efficiency of the method when varying the length of nucleotide sequences. In 
this case, the mean of correctly reconstructed trees for the invariants method is 90.2% 
whereas the mean for neighbor-joining is 84%. It is worth pointing out that the use of 
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Kimura distance is still justified in the non-homogeneous setting, so that it is fair to 
compare both methods. For this particular tree, the maximum likelihood algorithm for 
the Kimura 3-parameter model reconstructed the tree correctly almost all times, so we 
do not include the results here. 



/ 




Figure 3: A non-homogeneous tree used for simulations. The numbers labelling each edge 
correspond to branch lengths (i.e. expected percent change between the two taxa on the edge). 
(1) In comparing our method with neighbor-joining, the parameters aj,/?j,7i in the Kimura 
3-parameter rate matrices Qi were chosen as: 71 = l,ai = 4, /3i = 1, 72 = 5, 02 = 14, (h = 3 
,73 = 4,a 3 = 15, /? 3 = 3, 74 = 2,a 4 = 6,/3 4 = 2, 75 = 2,a 5 = 3,/3 5 = 1. (2) In comparing our 
method with maximum likelihood algorithm based on the Kimura 2-parameter model, we chose 
7 = (5 = 1 in all rate matrices whereas parameter a was set as follows: in Q\, a = 4, in Q2, Q4 
and Q5, a = 3 + e 2 , in Q3, a = 3 + e. When e = 1 the tree is homogeneous and we studied the 
efficiency of three methods as e increases up to 9. 



2. Comparison with maximum likelihood 

We compared the invariants method with two versions of maximum-likelihood: the usual 
maximum likelihood for Kimura 2-parameter |Kim80j and non-homogeneous maximum 
likelihood method developed in the package PAML |Yan97j for Kimura 2-parameter model 
(namely, the option nhomo). This last method allows different transition/transversion 
ratio in different tree branches. To perform this comparison we simulated data according 
to the tree in figure Ef2) evolving under the Kimura 2-parameter model. When e = 1 the 
tree is homogeneous and we studied the efficiency of the three methods as e increases up 
to 9. Figure 0] summarizes the results relative to the comparison between our method, 
the non-homogeneous method in PAML for Kimura 2-parameter model (option nhomo=2 in 
the baseml control file) and the maximum likelihood homogeneous in PAML for Kimura 2- 
parameter (option nhomo=0). It shows the percentage of correctly reconstructed trees for 
each value of parameter e. As expected, the homogeneous maximum-likelihood algorithm 
becomes less efficient as e increases. The use of a method assuming homogeneity on data 
produced in accord with a non-homogeneous model is, of course, not recommended as 
such model misspecification is known to lead to unreliable results. Already for e = 5, the 
invariants method overtakes the homogeneous maximum likelihood algorithm. As it is 
deduced from figure EJ our method performs worse than the non-homogeneous algorithm 
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Comparison with Maximum Likelihood 




E — Invariants 

3 — PAML non homogeneous 

_ - - - PAML homogeneous 

H i i i i r i i H 

1 2 3 4 5 6 7 8 9 
parameter e 

Figure 4: The effect of rate heterogeneity among lineages on method performance. The sim- 
ulated nucleotide sequences have length 1000 and evolve under the non-homogeneous Kimura 

2- parameter model as described in figure 01 (homogeneous for e = 1). The trees were recon- 
structed with our method (invariants), with PAML under a non-homogeneous method (PAML 
non-homogeneous) and the usual PAML maximum likelihood homogeneous (PAML homogeneous). 
For each e 100 sets of data were generated. 

in PAML. However, it is worth noticing that in this test we generated data according 
to Kimura 2-parameter model and the invariants method presented here was developed 
under Kimura 3-parameter model. Similar results are obtained when the data evolve 
under Kimura 3-parameter model (see figure II in the Supplementary Material). 

Methods 

The phylogenetic reconstruction method used in this paper is based on phylogenetic in- 
variants and was first introduced by the first author and L.D. Garcia and S. Sullivant in 

The taxa are given by an alignment of n DNA sequences of length N. A Markov 
process along an unrooted binary tree of n species is assumed and we consider that all 
sites are independently and identically distributed (i.i.d. hypotheses). The parameters 
of the model we are considering are the entries of the substitution matrices of a Kimura 

3- parameter model |Kim81j . (we should rather speak of an algebraic Kimura 3-parameter 
model, according to the book [PS05, chapters 1 and 4]). Note that in the usual Kimura 
3-parameter model, a rate matrix Q is fixed and common to the whole tree and the substi- 
tution matrix is the exponential e™' (for some parameter representing time). However, 
in our method we do not make use of rate matrices Q — we only use the substitution 
matrices — so, as we will see later, the rate matrices might vary among the edges. It is 
worth pointing out that, as we consider a Kimura 3-parameter model along all branches, 
the uniform distribution of base composition holds for the whole tree (stationarity hy- 
pothesis). 
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Phylogenetic invariants 



Sturmfels and Sullivant [SS05J gave an explicit description of the generators of the set 
of polynomial phylogenetic invariants I(T) for an arbitrary tree evolving under a group- 
based model. For the Kimura 3-parameter model on an unrooted 4-taxa tree the ideal of 
phylogenetic invariants has 8002 minimal generators (see the webpage GP05J, [CGS05J 
and the discussion in |SS05j , section 7 to see why a smaller subset of invariants does not 



suffice). According to the results in [SSQjy, we produced this generating set for an unrooted 
tree with 4 leaves under the Kimura 3-parameter model. This requires doing a Fourier 
transform (or Hadamard conjugation) on the vector of probabilities (pa...a, ■ ■ ■ ,Pt...t) and 
the phylogenetic invariants are then described as binomials in the Fourier coordinates. 



Algorithm 

Our tree reconstruction algorithm performs the following tasks. Given 4 aligned DNA 
sequences s±, S2, S3, S4, it first computes the observed relative frequencies of each pattern 
for the topology ((si, s 2 ), s 3 , s 4 ) on an unrooted quartet tree. Then it transforms these 
relative frequencies to Fourier coordinates. From this, we compute the Fourier coordinates 
in the other two possible topologies for unrooted trees with 4 species. We then evaluate all 
phylogenetic invariants for the Kimura 3-parameter model in the Fourier coordinates of 
each tree topology. We call sj the absolute value of this evaluation for the polynomial / 
and tree topology T. From these values {sJ} /; we produce a score for each tree topology 
T, namely s(T) = |sj|. The algorithm then chooses the topology that corresponds to 
the minimum score. The code was written in PERL and is available upon request. For an 
alignment of 4 sequences of 1000 nucleotides it takes 0.35s on a single 3.0-Ghz processor. 



Software used 

The simulations of this study were obtained using the program Seq-Gen vl.3.2 [RG97 . 
We used the algorithms implemented in the package APE |PSG + 06] vl.8-3 of R |R D05j 
v2.1.1 to compute Kimura 3-parameter distance KiniH 1 j and perform neighbor-joining 
algorithm |SN87j . The package PAML |Yan97j was used for phylogenetic inference involving 
maximum likelihood methods. 



Discussion 

The simulation studies performed in this paper present a very competitive phylogenetic 
reconstruction method based on invariants. If one compares our results on the tree space 
and those of Huelsenbeck |Hue95j one sees that the method presented here is highly 
efficient. Note that this might be a biased comparison as in both papers homogeneous 
models are used for sequence generation and, while he uses homogeneous methods for 
inference, our invariant method allows non-homogeneous models — remember that one 
generally obtains the best results using the most restricted model that fits the data. 
There are some limitations of the tree space study performed here, though. For example, 
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this tree space does not consider trees where the inner edge is extremely small or extremely 
large with respect to the peripheral branches. As Huelsenbeck points out, the usefulness 
of considering this parameter space can be questioned, but he also gives strong arguments 
that convinced us to work in this parameter space. Moreover, considering the same 
parameter space as Huelsenbeck allows one to compare our results to the other methods 
studied by him. 

We would like to comment on the algorithm presented here and, more generally, on all 
methods based on invariants. First of all we need to emphasize that the computation of 
the phylogenetic invariants of a given model just need to be computed once, so their com- 
putation need not contribute to the running-time of an algorithm using them. Secondly, 
increasing the size of the sequences does not drop the computational efficiency of the algo- 
rithm. Indeed, the sequences length only accounts for computing the relative frequencies 
of the observed patterns (which is something that most algorithms based on evolutionary 
models must do), but it does not participate in any other part of the algorithm. 

A small comment on the election of the 1-norm: we performed simulation studies not 
presented here to prove that the algorithm performs clearly better with the 1-norm than 
with the maximum norm, and slightly better than with the euclidean norm 1 . Another 
consideration that might be important for the computational efficiency of the method is 
that, in Fourier coordinates, the polynomials considered here are binomials and hence 
they are easy to evaluate at a given point (so there is no need to worry computationally 
about the evaluation of the polynomials). Moreover, as it is proved in SSO-l . these 
binomials have degree 4 at most so, again, the computational cost is low. Choosing 
another generating set of invariants or a different weightening of the polynomials will lead 
to other results, so this is an issue that should certainly be studied in the future. 

We implemented and tested the algorithm presented here only on 4-taxon trees. This 
seems a limitation of the method but as the reader may have noticed, the method is 
universal and could be used to infer the topology of trees with arbitrary number of taxa. 
However, the computational demands of deducing large phylogenies led us not to develop 
this algorithm for larger trees. Instead, we suggest that invariants might be a good 
starting point for quartet methods of phylogenetic inference. In this direction, it is also 
worth thinking about new tree reconstruction algorithms for arbitrary taxa based on 
invariants (this is something the authors will surely work on in the future). 

In this paper we focused on the Kimura 3-parameter evolutionary model. However, a 
full generating set of invariants is known for any group-based model ( |SS05| ). a large set of 
invariants is known for the general Markov model [X*R03j, AR04a 3 and for a strand sym- 
metric model |CS05j , and some invariants are already known for certain rate-class models 
AR06]. Therefore, the method presented here can be extrapolated to these evolutionary 
models and in the future further models can be considered. In particular, invariants might 
also perform well for models allowing site-to-site variation. 

As we pointed out in the introduction, invariants based methods focus on recovering 
the tree topology and not estimating the parameters. Nevertheless, as Allman and Rhodes 

lr rhe 1-norm of a vector x = (x\, . . . , x n ) is ||x||i = X)"=i \ x i\i the euclidean norm is ||x||2 = Vz3=i x f 
and the maximum norm is ||x||oo — max^ |xj|. 
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say ( [A R03j . |AR04bj ) . it is fair to think that phylogenetic invariants may also be useful 
for parameter recovery. 

As we have already mentioned, one of the advantages of the method presented here 
versus other methods of phylogenetic reconstruction based on evolutionary models is 
that the model considered here is non-homogeneous in the sense that the rate matrix is 
allowed to vary along the different branches of the tree. However, as we are assuming 
a Kimura model, the base distribution is the same at all nodes of the tree and so the 
model is stationary (note that invariants methods based on other evolutionary models 
would not require stationarity of base distribution). For an unrooted tree with n taxa, 
our algebraic Kimura model has 3(2n — 3) free parameters, and it is a special case of 
the general Markov model which involves 12 (2n — 3) + 3 parameters. If one considered a 
Kimura 3-parameter model (not the algebraic model considered here) allowing different 
rate matrices along the tree one would have 3(2n — 3) + In — 4 (the extra parameters 
correspond to time parameters). Other non-homogeneous models have been considered 
in the literature, see for example |GG98| and |YR95j . The emphasis in these two papers 
is put on the non-stationarity hypothesis, and the maximum likelihood approach taken 
in most non-homogeneous methods makes them computationally unfeasible. 

Supplementary material 

Supplementary material includes figures I, II, III and IV. 
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Figure 5: (a) Unrooted 4-taxa tree. The branches labelled with the same letter (either a or 
b) have the same branch length. The taxa are named with arabic numbers, 1 to 4. (b) The 
parameter space for the unrooted four-taxon tree. Parameter a is plotted in the abscissa, while 
parameter b in the ordinate. The lengths of the branches a and b are varied from 0.01 to 0.75 
in increments of 2%. 
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Figure 6: The effect of rate heterogeneity among lineages on method performance. The sim- 
ulated nucleotide sequences have length 1000 and evolve under the non-homogeneous Kimura 
3-parameter model for the parameters 7 = 1, (3 = 2 in all rate matrices while parameter a was 
set as follows: in Qi, a = 4, in Q2, Q4 and Q5, a = 3 + e 2 , in Q3, a = 3 + e (see figure 3). 
The trees were reconstructed with our method (invariants), and with the usual PAML maxi- 
mum likelihood homogeneous (PAML homogeneous). For each value of e, 100 sets of data were 
generated. 
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Figure 7: The graphics above represent the probability of reconstructing the correct tree in the 
parameter space (see Figure I). Black areas correspond to couples (a, b) for which the tree was 
correctly estimated, while white regions correspond to couples (a, b) for which the tree is never 
estimated; grey tones indicate areas of intermediate probability The 95% isocline is drawn in 
white, while the 33% is drawn in black. The four graphics above show the results obtained 
under the homogeneous Kimura 3-parameter model when using a rate matrix with parameters 
7 = 0.2, a = 0.5, (3 = 0.3 (hence a 1:1 transitiomtransversion bias). 
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Figure 8: The four graphics above show the results obtained in the four-taxon space when the se- 
quences are generated under the homogeneous Kimura 3-parameter model of substitution using a 
rate matrix with parameters 7 = 0.12, a = 0.73, (5 = 0.13 (hence a 2.93:1 transitiomtransversion 
bias). This rate matrix corresponds to the maximum likelihood estimate from an alignment of 
homologous sequences of eight vertebrates under a Kimura 3-parameter model. 
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